A Comparative Study of Text Preprocessing Approaches for Topic Detection of User Utterances

نویسندگان

  • Roman B. Sergienko
  • Muhammad Shan
  • Wolfgang Minker
چکیده

The paper describes a comparative study of existing and novel text preprocessing and classification techniques for domain detection of user utterances. Two corpora are considered. The first one contains customer calls to a call centre for further call routing; the second one contains answers of call centre employees with different kinds of customer orientation behaviour. Seven different unsupervised and supervised term weighting methods were applied. The collective use of term weighting methods is proposed for classification effectiveness improvement. Four different dimensionality reduction methods were applied: stop-words filtering with stemming, feature selection based on term weights, feature transformation based on term clustering, and a novel feature transformation method based on terms belonging to classes. As classification algorithms we used k-NN and a SVM-based algorithm. The numerical experiments have shown that the simultaneous use of the novel proposed approaches (collectives of term weighting methods and the novel feature transformation method) allows reaching the high classification results with very small number of features.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A review of text mining approaches and their function in discovering and extracting a topic

Background and aim: Four text mining methods are examined and focused on understanding and identifying their properties and limitations in subject discovery. Methodology: The study is an analytical review of the literature of text mining and topic modeling.  Findings: LSA could be used to classify specific and unique topics in documents that address only a single topic. The other three text min...

متن کامل

Parameters for Topic Boundary Detection in Multi-Party Dialogues

We present a topic boundary detection method that searches for connections between sequences of utterances in multi party dialogues. The connections are established based on word identity. We compare our method to a state-of-the art automatic topic boundary detection method that was also used on multi party dialogues. We checked various methods of preprocessing of the data, including stemming, ...

متن کامل

Qualitative and Quantitative Examination of Text Type Readabilities: A Comparative Analysis

This study compared 2 main approaches to readability assessment. Thequantitative approach applied idea density based on part of speech tagging andcompared 3 sets of text types (i.e., narrative, expository, and argumentative) withrespect to their ease of reading. The qualitative approach was done throughdeveloping questionnaires measuring intermediate EFL learners’ perceptions oncontent, motivat...

متن کامل

A confirmatory study of Differential Item Functioning on EFL reading comprehension

The  present  study  aimed  at  investigating  DIF  sources  on  an  EFL  reading  comprehension test.  Accordingly,  2  DIF  detection  methods,  logistic  regression  (LR)  and  item  response theory  (IRT),  were  used  to  flag  emergent  DIF  of  203  (110  females  &  93  males)  Iranian EFL examinees’ performance on a reading comprehension test. Seven hypothetical DIF sources were examin...

متن کامل

Paraphrasing the meaning of physical environment comparative examining of audience-oriented, author-oriented and text-oriented (Islamic) approaches

Discussions derived from epistemology and its sub-branches are of the most important theoretical grounds affecting theoretical basis of art schools in particular architecture styles. During recent decades, two approaches of epistemology have been reciprocally shaped to know how one can paraphrase the meaning of physical environment. In the first approach, the audience and his knowledge are main...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016